The InteractiveOceans Data Portal¶
The InteractiveOceans Data Portal is an interface designed to enable scientists, educators, and students to quickly locate and plot cabled, real-time Ocean Observatories Initiative (OOI) data produced by the Regional Cabled Array (RCA). The goals of this site are
to increase active use of RCA data by scientists and to support educators and public exploration of data in the future;
to provide additional tools for scientists to discover, access, visualize, and use RCA data sets suitable for addressing specific science hypotheses;
to provide an intuitive, user-friendly data search and visualization interface, coupled with a convenient data downloading scheme;
to accelerate research output and engage a broader user base for OOI data. The dashboard provides advanced search capabilities and links to data visualization tools to quickly create plots of various types for streaming data.
Because these datasets can be quite large, we use tools like data shading to quickly render a decimated version of the data while retaining the meaningful structure of the time series.
How Interactive Oceans Displays Large Datasets¶
This notebook will walk you through our process in displaying some of the large datasets from OOI
import cmocean
from dask.utils import memory_repr
import matplotlib.pyplot as plt
import hvplot.xarray
from ooi_harvester.models import OOIDataset
Get Data¶
We will be requesting Axial Base Shallow Profiler CTD Data
desired_parameters = ['time', 'seawater_pressure', 'seawater_temperature']
ctd = OOIDataset("RS03AXPS-SF03A-2A-CTDPFA302-streamed-ctdpf_sbe43_sample")[desired_parameters]
ctd
<RS03AXPS-SF03A-2A-CTDPFA302-streamed-ctdpf_sbe43_sample: 52.2 GB>
Dimensions: (time)
Data variables:
seawater_pressure
seawater_temperature
time
This dataset has a total size of 52.2GB
start_dt, end_dt = "2020-01-01", "2021-01-01"
%%time
ctd_ds = ctd.sel(time=slice(start_dt, end_dt)).dataset
CPU times: user 14.9 s, sys: 9.09 s, total: 24 s
Wall time: 33.6 s
ctd_ds
<xarray.Dataset>
Dimensions: (time: 29317301)
Coordinates:
* time (time) datetime64[ns] 2020-01-01T00:00:00.235197952...
Data variables:
seawater_pressure (time) float64 dask.array<chunksize=(4868438,), meta=np.ndarray>
seawater_temperature (time) float64 dask.array<chunksize=(4868438,), meta=np.ndarray>- time: 29317301
- time(time)datetime64[ns]2020-01-01T00:00:00.235197952 .....
- axis :
- T
- long_name :
- time
- standard_name :
- time
array(['2020-01-01T00:00:00.235197952', '2020-01-01T00:00:01.234995712', '2020-01-01T00:00:02.235314688', ..., '2020-12-31T23:59:57.987104256', '2020-12-31T23:59:58.986696704', '2020-12-31T23:59:59.986810880'], dtype='datetime64[ns]')
- seawater_pressure(time)float64dask.array<chunksize=(4868438,), meta=np.ndarray>
- ancillary_variables :
- seawater_pressure_qartod_results seawater_pressure_qartod_executed pressure pressure_temp
- comment :
- Seawater Pressure refers to the pressure exerted on a sensor in situ by the weight of the column of seawater above it. It is calculated by subtracting one standard atmosphere from the absolute pressure at the sensor to remove the weight of the atmosphere on top of the water column. The pressure at a sensor in situ provides a metric of the depth of that sensor.
- coordinates :
- lat depth lon time
- data_product_identifier :
- PRESWAT_L1
- long_name :
- Seawater Pressure
- precision :
- 3
- standard_name :
- sea_water_pressure
- units :
- dbar
Array Chunk Bytes 223.67 MiB 99.99 MiB Shape (29317301,) (13106200,) Count 17 Tasks 3 Chunks Type float64 numpy.ndarray - seawater_temperature(time)float64dask.array<chunksize=(4868438,), meta=np.ndarray>
- ancillary_variables :
- seawater_temperature_qartod_results seawater_temperature_qartod_executed temperature
- comment :
- Seawater temperature near the sensor.
- coordinates :
- lat depth lon time
- data_product_identifier :
- TEMPWAT_L1
- long_name :
- Seawater Temperature
- precision :
- 4
- standard_name :
- sea_water_temperature
- units :
- degree_C
Array Chunk Bytes 223.67 MiB 99.99 MiB Shape (29317301,) (13106200,) Count 17 Tasks 3 Chunks Type float64 numpy.ndarray
There are about 29 million data points within that time range. This is huge for visualization!
We can check the size of 1 year of this dataset
print(f"This dataset size is {memory_repr(ctd_ds.nbytes)}")
This dataset size is 671.0 MB
Plotting¶
Now let’s try to create a depth plot (time, pressure, and temperature). We use hvPlot to perform the plotting. Using a plotting tool like matplotlib would take a really long time to plot.
Using matplotlib¶
fig, ax = plt.subplots()
ctd_ds.plot.scatter(x='time', y='seawater_pressure', hue='seawater_temperature', cmap=cmocean.cm.thermal)
ax.invert_yaxis()
ax.set_title('Axial Base Shallow Profiler CTD')
plt.tight_layout()
plt.savefig('ctd-profile.png', dpi=300, bbox_inches='tight', transparent=True)
For purpose of comparison, the plot above was created with matplotlib pyplot using the builtin xarray plotting function.
Using hvPlot¶
plot_size = (888, 450)
%%time
plot = ctd_ds.hvplot.scatter(
x='time',
y='seawater_pressure',
color='seawater_temperature',
rasterize=True,
cmap=cmocean.cm.thermal,
width=plot_size[0],
height=plot_size[1],
).options(
invert_yaxis=True,
title='Axial Base Shallow Profiler CTD'
)
plot
CPU times: user 6.04 s, sys: 1.8 s, total: 7.84 s
Wall time: 11.9 s
The hvPlot python library is part of the HoloViz Python Visualization Tools Ecosystem. Underneath, hvPlot utilizes HoloViews and Datashader in order to create the plot. We take the resulting data from the hvPlot plot and serialize that to JSON format for our frontend visualization engine plotly to render.
You can see that the resulting datashaded plot has exactly the same pattern seen in the matplotlib plot. For example, around 9/2020 there is a warmer water at the surface. This shows the accuracy of datashading.
Extracting underlying dataset¶
plot_data = plot[()].data
plot_data
<xarray.Dataset>
Dimensions: (seawater_pressure: 450, time: 888)
Coordinates:
* time (time) datetime64[ns] 2020-0...
* seawater_pressure (seawater_pressure) float64 ...
Data variables:
time_seawater_pressure seawater_temperature (seawater_pressure, time) float64 ...- seawater_pressure: 450
- time: 888
- time(time)datetime64[ns]2020-01-01T04:56:45.640463 ... 2...
array(['2020-01-01T04:56:45.640463000', '2020-01-01T14:50:16.450994000', '2020-01-02T00:43:47.261525000', ..., '2020-12-30T23:16:12.960483000', '2020-12-31T09:09:43.771014000', '2020-12-31T19:03:14.581545000'], dtype='datetime64[ns]') - seawater_pressure(seawater_pressure)float643.796 4.217 4.637 ... 192.2 192.6
array([ 3.796447, 4.216927, 4.637408, ..., 191.751237, 192.171717, 192.592198])
- time_seawater_pressure seawater_temperature(seawater_pressure, time)float64nan nan nan nan ... nan nan nan nan
array([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]])
The xarray dataset shown above is the resulting aggregated data from the datashading process that we push to the frontend application.
That’s all. That process happens with all of the datasets that we have, when the data request is large enough.